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Abstract 

Ancestral maximum likelihood (AML) is a method that simultaneously 
reconstructs a phylogenetic tree and ancestral sequences from extant data 
(sequences at the leaves). The tree and ancestral sequences maximize the 
probability of observing the given data under a Markov model of sequence 
evolution, in which branch lengths are also optimized but constrained to 
take the same value on any edge across all sequence sites. AML differs 
from the more usual form of maximum likelihood (ML) in phylogenetics 
because ML averages over all possible ancestral sequences. ML has long 
been know to be statistically consistent - that is, it converges on the correct 
tree with probability approaching 1 as the sequence length grows. However, 
the statistical consistency of AML has not been formally determined, despite 
informal remarks in a literature that dates back 20 years. In this short note 
we prove a general result that implies that AML is statistically inconsistent. 
In particular we show that AML can 'shrink' short edges in a tree, resulting 
in a tree that has no internal resolution as the sequence length grows. Our 
results apply to any number of taxa. 
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1 Introduction 



Markov models of site substitution in DNA are the basis for most methods for 
inferring phylogenies (evolutionary trees) from aligned sequence data. The usual 
approach is maximum likelihood (ML) which seeks the tree and branch lengths 
that maximizes the probability of generating the observed data under a Markov 
process. In the simplest setting one assumes that sites evolve independently and 
identically, and that the extant sequences (data) label the leaves of the tree - for 
background on phylogenetics and ML see [0. ML is computational complicated, 
and even the problem of finding the optimal branch lengths exactly on a fixed 
tree has unknown complexity. In ML one considers all possible ancestral se- 
quences that could have existed within the tree, and averages each such 'scenario' 
by its probability. An alternative is to simply consider a single choice of ancestral 
sequences that has the highest probability - this is a variant of ML that was in- 
troduced in 1987 by Barry and Hartigan [3] under the name 'most parsimonious 
likelihood', and which later was renamed ancestral maximum likelihood (AML) 
(see e.g. (HI). The computational complexity of AML is slightly easier than ML, 
in that given the tree and either the optimal branch lengths or the optimal ancestral 
sequences, the other 'unknown' (ancestral sequences or branch length) is readily 
determined (see eg. [0). The method can be viewed as being, in some sense, 
intermediate between ML and a primitive cladistic method, maximum parsimony 
(MP), which seeks the tree and ancestral sequences that minimizes the total num- 
ber of sites substitutions required to describe the data. Indeed, AML would select 
the same trees as MP if one further constrained AML so that each edge had the 
same branch length, as shown in IflOll . 

The recent interest in AML has sprung from computational complexity con- 
siderations. Firstly, AML seemed to provide a promising route by which to show 
that the problem of reconstructing an ML tree from sequences is NP-hard [TJO. It 
turned out that the NP-hardness of ML can be established directly, without invok- 
ing AML |fT5l , however the relative computational simplicity of AML over ML 
suggests it may provide an alternative strategy for reconstructing large trees. 

Nevertheless, it is important to know whether the desirable statistical proper- 
ties of ML carry over to methods such as AML. In particular ML has long been 
known to be statistically consistent as a way of estimating tree topologies - that 
is, as the sequence length grows, the probability that ML will reconstruct the tree 
that generated the sequences tends to 1. It has also been known (since 1978) that 
more primitive methods, such as MP, can be statistically inconsistent (HI. 

However the statistical consistency of AML is unclear, since the standard 
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Wald-style conditions required to prove consistency (in particular a fixed param- 
eter space that does not grow with the size of the data) does not apply. Thus, one 
may suspect that AML might be inconsistent, and indeed remarks in the literature 
have suggested this could be the case (see flU, |[TT10 . However the absence of a suf- 
ficient condition to prove consistency does not constitute proof of inconsistency, 
and the purpose of this short note is to formally show that AML is statistically 
inconsistent. More precisely we show that AML tends to 'shrink' short edges in a 
tree, and this can result in the collapse of the interior edges (and any short pendant 
edges) to produce a star tree. 

The results in this paper rely on probability arguments, based on expansions 
of the entropy function, and combinatorial properties of minimal sets of edges that 
separate each pair of leaves in a tree. 



1.1 Problem Statement 

CFN model We define [n] = {0, . . . , n — 1} and we deal with the Cavender- 
Farris-Neyman (CFN) model H51I71H51. 

Definition 1 (CFN model) We are given a tree T = (V, E) on n leaves labelled 
[n] and an assignment of edge probabilities p : E — > (0, 1/2). A realization of 
the model is obtained as follows: choose any vertex as a root; pick a state for the 
root uniformly at random in {0, 1}; moving away from the root, each edge e flips 
the state of its ancestor with probability p e . We denote by X the ( random) state at 
the leaves obtained in this manner. We write X ~ CFN(T, p). 



Ancestral Maximum Likelihood We consider two equivalent formulations of 
the Ancestral Maximum Likelihood problem. The second version is obtained by 
setting 

Pe = T' 

for all e in the first version fll). 

Definition 2 (AML, Version 1) The Ancestral Maximum Likelihood (AML) prob- 
lem can be stated as follows. Given a set of n binary sequences of length k, find 
a tree T = (V, E) on n leaves, an assignment p : E — > [0, 1/2] of edge prob- 
abilities, and an assignment of sequences X : V — ► {0, l} k to the vertices such 
that: 

1. The sequences at the leaves under X are exactly the sequences from S; 
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2. The quantity 



£(T,p|A) = -log 2 (n^ e (l-Pe) fc " de 



is minimized, where 

Definition 3 (AML, Version 2 [[1]]) The Ancestral Maximum Likelihood (AML) 
problem can alternatively be stated as follows. Given a set ofn binary sequences 
of length k, find a tree T on n leaves and an assignment of sequences X : V — > 
{0, l} fc to the vertices such that: 

1. The sequences at the leaves under X are exactly the sequences from S; 



2. The quantity 



wen *) = 



eS-B 

is minimized, where recall that the entropy function is 

H(p) = -p\og 2 p- (1 -p) log 2 (l -p), 

for0<p<l. 

Consistency A phylogeny estimator $ = ) n ,k>i} is a collection of map- 

pings from sequences to trees, that is, 

a>(fc) ■ rW _^ r 

(k) 

where B n is the set of all assignments of the form 

B« = {/x | ,x : [n] - {0, 1}% 

and T n is the set of all trees on n leaves labelled by [n]. Let X = {Xi, X 2 , . . .} 
with Xj : [n] — > {0, 1} for n > 1. For all A; > 1, we denote by /x = /i^ the 
assignment in Bn^ such that = (Xj) v for all u £ [n] and j = 1, . . . , k. 

Definition 4 (Consistency) A phylogeny estimator <3> is said to be (statistically) 
consistent if for all n, all trees T = (V, E) £ T n , and all edge probability assign- 
ments p : E — >• (0, 1/2), it holds that 

almost surely as k — > +oo, where X = {Jfi, X 2 , . . .} wzY/? Xl, X 2 , . . . indepen- 
dently generated by CFN(T, p). 
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1.2 Main Result 

Let $aml be the AML phylogeny estimator for AML Version 1, where all edges 
e with p e = have been contracted and all edges e with p e = 1/2 have been 
removed. (Break ties arbitrarily.) 

Theorem 1 (AML Is Not Consistent) For all n > 1 and each tree T = (V, E) e 
T n , there is a [3 > and a shrinkage zone Q T = YleeE ^ suc ^ that \I e \ > [3 for 
all e and if p e Qt, $aml returns a star rooted at in the limit k — > +oo 
on dataset X = {Xi, . . . wzY/? Xi, . . . , independently generated by 



The phenomenon described in Theorem[T]is illustrated in Fig. 1 . We note that our 
result does not imply the stronger statement that AML is "positively misleading" 
since we can think of the rooted star as the correct tree T where several edges are 
set to p e = 0. Note however that the solution is highly degenerate since the star 
can be obtained in this way from any tree. In other words, in the shrinkage zone, 
AML provides no information about the internal structure of the tree even with 
infinitely long sequences. 



Figure 1 : The shrinkage effect: For the tree on the left, AML will reconstruct the 
star tree (right) from sufficiently long sequences 

1.3 Organization 

We begin with some preliminary remarks in Section [2l The proof of Theorem Q] 
can be found in Section [3] 



CFN(T,p). 
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2 Preliminaries 



2.1 Solution Properties 

(k) 

Fixed Extension Let T G T n . For an assignment of sequences fi G o„ and 
1 < j < k, we call x : M — > {0, 1} with Xu = {^u)j for all u G [n] the j-f/z 
character in /x. We write x G // if there is j such that x is the j-th character in fi. 
We also denote by x# the number of characters in /x equal to x- An extension of 
a character x is a mapping x : V — > {0, 1} such that Xv = Xv f° r ai l ^ ^ N« We 
denote by V(x) the set of all extensions of x on T. Let/ : {0, 1}M -> {0, 
The mapping then defines an extension for all characters simultaneously by setting 
(Xf)v = Xv for all v G [n] and = /(x)« for all v G V — [n]. We show next 
that AML is in fact equivalent to finding such an /, which can significantly reduce 

(k) 

the size of the problem for large k. For a set of n binary sequences n G B n and 
a tree T = (V, E) G 7^, we denote by fif the extension of [i to V by applying / 
as above to every character in /x. 

Definition 5 (AML, Version 3) Given a set ofn binary sequences fj, G find 
a tree T ET n and a mapping f : {0,1}^ — > {0, such that the quantity 

«(r|M,) = X>(f). 

is minimized. 

Proposition 1 (AML, Version 3) There is always a solution of AML Version 1 
and 2 of the form X = fl f for some f : {0, 1}M -> {0, l} y ~H 

Proof: Note that 



£(T,p|A) = - 




= -A;^log 2 (l -p e ) - 

k 

E E l{(A«)^(A^}log 2T ^-. 

3=1 {u,v)eE Pe 

For fixed p, since C "decomposes" in j, it is always possible to take the same 
extension for each character appearing in fi without affecting optimality. Then, 
we can choose the optimal p as in [HI to obtain the result. ■ 
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Limit Problem Let T = (V, E) e T n . Assume as in Theorem \T\ that we 
are given a dataset X = {X 1 ,X 2 ,...} with X 1 ,X 2 ,... i.i.d. CFN(T, p)_. Fix 
/ : {0, 1}W -> {0, l} y -H. Let X ~ CFN(T, p) and denote by F = X f the 
extension of X under /. Also, let p£\ be the extension of /Lt^ under /. By the 

Law of Large Numbers, as k — > +00, the quantity TC(T | Ax/) converges almost 
surely to 

M x>T (/) = x;^(n), 

eG-B 

where, for e = (w, v), Y e is the indicator that Y u ^ Y v , and H(Y e ) is the entropy 
of Y e , that is, 

#(y e ) = H(P[Y U + Y v \). 

Note that, by Proposition[T] even as k — > +00 there are only a constant number of 
mappings / to consider. We say that / is Hx,r-optimal if / minimizes i XjT (/) 
over all / : {0, 1}^ — > {0, l} y_ N. The minimum need not be unique. 

Definition 6 (Expected AML) Given a random variable X taking values in {0, l}'"-', 
find a tree T = (V, E) 6 T n and a mapping f : {0, 1}^ — > {0, l} y ~[ n l smc/z ?/zaf 
quantity 

H x , T (/) = ^tf(Y e ), 

eS-E 

minimized, where Y — Xf. 

By the previous remarks and (Q~|), to prove Theorem[T]it suffices to show: 

Theorem 2 (Optimal Assignment) Le? T' = (V, E') e T n and let X ~ CFN(T', p). 
Then there is a [3 > and a shrinkage zone Qt = YleeE ^ e suc ^ ^ at \^\ > Pfor 
all e and for all for all T = (V, E) G T n , the unique Mx,t -optimal f : {0,1}^ — > 
{0, l} y_ [ n l assigns to all internal nodes ofV the value at leafO under all charac- 
ters, that is, 

f(x) = Oo, • • .,x ), 

for all x E {0,1}K 

2.2 Minimal Isolating Sets 

Definition In preparation for our proof of Theorem[2l we will need the following 
notion which is studied in lfT2ll . 
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Definition 7 (Isolating Set) Let T = (V, E) be a tree. A subset S of E is called 
an isolating set for T if for any two leaves u, v there exists an edge e G S on the 
path connecting u and v. 

The following result is proved in lfT2ll . 

Proposition 2 (Minimal Isolating Set) The size of a minimal isolating set on an 
n-leaftree is n — 1. 

We will also need: 

Proposition 3 (One Leaf Per Component) Let T be a tree on n leaves and let 
S be a minimal isolating set on T. Consider the forest F obtained from T by 
removing all edges in S. Then, each component of F contains exactly one leaf of 
T. 

Proof: If a component of F contains two leaves, then these cannot be isolated 
under S, a contradiction. On the other hand, if a component T' of F does not 
contain a leaf, then every edge adjacent to T' in T is in fact in S. But then one can 
remove one of these edges without losing the isolating property of S, contradicting 
the minimality of S. ■ 

Minimally Isolating / Let T = (V, E) e T n and / : {0, 1}W -> {0, 

We denote by Sf C E the set of edges e = (u, v) such that there is x E {0, 1}^ 

with f(x) u ^ f(x) v . 

Definition 8 (Minimally Isolating /) We say that f is minimally isolating for T 
if Sf is a minimal isolating set ofT. 

2.3 Random cluster parameterization 

We will sometimes require a different ('random cluster') parameterization of the 
CFN model. Let T e T n and p e [0, 1} E . (Note that we allow p e in [0, 1].) We let 

9 e = 1 - 2 Pe , 

for all e 6 E. The main property we will use is the following well-known identity. 
For two leaves w, v in T, let Path r (w, v) be the set of edges on the path between 
u and v. 
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Proposition 4 (Path Probability) Let T = (V, E) E T n and p E [0, l] E . Assume 
X ~ CFN(T, p). Let u,v be two leaves ofT. Then we have 



In this section, we prove Theorem |2] from which Theorem Q] follows. The proof 
has two components: 

1 . [Reduction to Minimal Isolating Sets] We first show that for any random 
variable X E {0, l}'™' close enough to uniform and any tree T E T n , the 
Hx,T-optimal /'s are minimally isolating for T. 

2. [Rooted Star is Optimal] Second, we show that if X above is CFN(T', p) 
for some T' E T n with p e « 1/2 if e is adjacent to {1, . . . , n — 1} andp e ~ 
otherwise, then for all T E T n the unique Hx,r-optimal / assigns the value 
at to all internal nodes. 

Throughout, n > 1 is fixed. 

3.1 Reduction to Minimal Isolating Sets 

We prove the following: 

Proposition 5 (Reduction to Minimal Isolating Sets) There exists e > (de- 
pending onn) such that the following hold. Let X be any random variable taking 
values in {0, 1}'^ with H(X) > n — e and let T be any tree in T n . If f is Hj^- 
optimal, then f is minimally isolating for T. 

Proof: We make a series of claims. 

Claim 1 (Reduction to Uniform) For all 6 > there exists e = e(5) > such 
that if X is a {0, l}^-random variable with 




3 Proof 



H{X) >n-e, 



and f : {0, 1}W -> {0, l} y "W then 



n x>T (f)-M u>T (f)\<6, 



(2) 
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where U is the uniform distribution on {0, 1}^. Therefore, it suffices to prove 
Proposition\5\for those / that are l&u ^-optimal. 

Proof: The entropy of {0, l}^-random variables is maximized uniquely at H{U) = 
n. The first part of the result follows by continuity of H(X) and Hx,t(/) in the 
distribution of X. 

For the second part, take 5 > small enough such that for all /, /', we have 

M u>T (f) > M u>T (f) U U;T (f) > M u>T (f) + 25. (3) 

(Recall that there are only finitely many /'s for fixed n.) Take e > such that the 
first part holds. Then it follows that if / is IHx,T-optimal then it must be M U>T - 
optimal. We argue by contradiction. Assume there are /, /' such that M x>T (f) < 
M x>T (f) but M u>T (f) > M UtT (f). By ©, we have 

U u>T (f) > U u>T (f) + 25, (4) 

which implies Mx,r(f) > HIx,t(/0 by (0, a contradiction. ■ 

Claim 2 (Minimizer) If f is M U:T -optimal then Wu,T(f) — n — 1. Moreover, 
denoting Y = Uf we have that {Y Q , (Y e ) e£ E} are mutually independent. 

Proof: 

Upper Bound We first show that there is / such that M. UiT (f) < n — 1. Let S 
be a minimal isolating set for T. Define / by letting f(x) u = f(x) v for all edges 
(u,v) not in S. By Proposition [31 this uniquely defines /. Letting Y = Uf it is 
immediate to check that 

e£E eeS 

by Proposition [2] 

Lower Bound For any / : {0, 1}M -> {0, 1} V ~N with Y = U f , we have 

n = H(U) = H({(Y v ) ve[n] }) = H({Y , (Y e ) eeE }) 

< H(Y ) + H ( Y e) < 1 + E H 

eS-E eS-E 
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where we have used that {(Y„) ue [ n ]} and {Y , (Y e ) eeE } are deterministic functions 
of each other. Furthermore, the first inequality holds to equality if and only if 
{Y , (Y e ) eG E} are mutually independent. ■ 

We are ready to conclude the proof of Proposition [5] Let / be H^T-optimal 
with Y = Uf. Let u, v be any two leaves of T. We have by the previous claim 
that (F e ) eg p at h T (?i,u) are mutually independent. Since Y u and Y v are independent 
uniform {0, 1} it must be that there is an edge e G Path T (w, v) with H(Y e ) = 1. 
Indeed, define p e = P[F e = 1] and 9 e = 1 — 2p e . Then by Proposition|4]we have 

= 1- 2F[Y U ^Y v }= J] e , 

eSPathr(ii,i;) 

which implies that at least one 9 e = 0. Let S' be the set of all edges where 
H(Y e ) = 1. Then we have shown that 5" is an isolating set. Note furthermore that 
if e £ S f then H(Y e ) > H{2~ n ) > 0. From /'s optimality we obtain 

n - 1 = HM/) > \S'\ + |S/ \ S , |i/(2-™). 

Therefore we must have Sf = S' and \S'\ = n — 1 which implies that Sf is a 
minimal isolating set as needed. ■ 

3.2 The Rooted Star is Optimal 

Let T = (V, E) e T n and S a minimal isolating set of T. Let T° be the tree 
obtained from T by contracting all edges not in S. By Proposition [3J T° is an 
n-node tree where each node (leaf or internal) is (uniquely) labelled by a leaf of 
T. Let 7^° be all such trees on n nodes. By Proposition [51 the AML phylogeny 
estimator is among 7^°. Note that for T e 7^° the only possible extension is 
trivially / = 1 since there are no unlabelled internal vertices. 

Proposition 6 (Rooted Star is Optimal) Let T = (V, E) e T n . Let W be the set 

of leaf edges ofT, except the edge pendant at 0. Then for e, 5 > sufficiently 
small the following holds. Assume X ~ CFN(T, p) with corresponding random 
cluster parameterization satisfying < 9 e < efor all e e W and 1 > 9 e > 1 — 5 
for all e W. Then, among all trees T' £ 7^°, ?/ze star rooted at uniquely 
minimizes Hx,t' (1) / or a ^ <5 sufficiently small. 

Proof: We assume that 5 and e are small enough so that they satisfy 

(n-l)(l-<5) 2n - 4 >n-2, 
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and 

e 2 < {n- l)(l-5) 2n - 4 - (n-2). (5) 

LetT' = (V',E') e 7^ and/ = 1 with corresponding variables (Y , {Y e } e£E >) 
where Y = X and Y U)V = t{X u ^ X v }. Let e = (u, v) be an edge in T' . In 
particular, note that u and v are leaves of T. Let p u t , be the probability that u and 
v disagree and let Q UyV = 1 — 2p u , v . We will use the following Taylor expansion 
of the entropy around 1/2 

iffl-i-^y +<**>. 

Note further that 

H(Y e ) = H(p u>v ) = H • 

As e approaches 0, p u>v goes to 1/2. Therefore, by Proposition SI up to smaller 
order terms we want to find X" = (V, E') in that maximizes 

®{T'):= Yl II °l- 

e'=(u,v)eE' e£Pa,th T (u,v) 

If T' has an edge e' between two leaves neither of which is 0, then e' makes a 
contribution of at most e A to 6(T") since Pathr(«, v) crosses two edges in W. 
Therefore, by ©, 

0(T') < (n-2)e 2 + e 4 

< (n-l)(l-5) 2n - 4 e 2 , 

where we have used that T' has exactly n — 1 edges and each edge e" E E' makes 
a contribution of at most e 2 since PatliT(w, v) contains at least one edge in W. On 
the other hand, the star rooted at 0, which we denote by T* , is the only tree in 
which does not include an edge between two leaves neither of which is 0. In that 
case, we get 

0(T*) > (n- 1)(1 -5) 2{n - 2) e 2 , 

where we have used that any path between and another leaf in T contains at 
most n — 2 edges not in W (since \E\ < 2n — 3 and \ W\ = n — 1) and exactly 
one edge in W. Taking e small enough gives the result. ■ 
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4 Concluding remarks 

It would be interesting to extend our results beyond the 2-state case. We note 
in particular that for the symmetric r- state model, with r > 2, the equivalent 
formulation of the AML problem given in Definition [3] does not apply. Indeed, it 
is easy to check that, instead, one needs to minimize 



The second term on the r.h.s. — a parsimony "correction" — may lead to a different 
behavior when r > 2. 

We thank Peter Ralph for sharing his recent, independent results 031 regarding 
the structure of the optimal solution in the 2-state case (similarly to fl2)) as well as 
a number of simulations on 4-taxon trees. 
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